Taotao Tan
Euclidean distance is simply the geometric distance between the two points. Recall the basic Pythagorean formula:
$$a^2 + b^2 = c^2$$where $a$ and $b$ are sides of a right triangle and $c$ is the hypotenuse. Specifically, $a^2$ is the difference between the two points in the first dimension squared, and $b^2$ is the difference in the second dimension squared. To get the Euclidian distance $c$ we take the $\sqrt{c^2}$, or simply $\sqrt{ a^2+b^2}$.
To express Euclidian distance in multiple dimensions, we can use the same formula, but we keep adding dimensions. e.g. $a^2+b^2+c^2=d^2$. It now becomes more convenient to change our notation a bit. Letâs say we have $n$ dimensions, and letâs call them $D_1,D_2,...D_n$. Now for any two points $x=(x_1,...,x_n)$ and $y=(y_1,...,y_n)$ in $n$-dimensional space, we can write the Euclidian distance as:
$$d_{xy} = \sqrt{\sum_{i= 1}^n{(y_i - x_i)^2}}$$To find the Euclidean distance between GeneA and GeneB, letâs refer to the value of GeneA in Sample1 as $A_1$ and the value of GeneA in Sample2 as $A_2$; letâs call the value of GeneB in Sample1 $B_1$ and in Sample2 $B_2$; and so on.
The Euclidean distance between GeneA and GeneB can now be calcuated using the following formula:
$$d_{AB} = \sqrt{(B_1 - A_1)^2 + (B_2 - A_2)^2 + (B_3 - A_3)^2 + (B_4 - A_4)^2} = \sqrt{\sum_{i = 1}^{4}{(B_i - A_i)^2}}$$
We will subset the data to only include the columns we are interested in. Then we will separate the data into test and training set.
# install scipy and plotly if you haven't
# !pip install scipy
# !pip install plotly
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage
from matplotlib import pyplot as plt
import plotly.express as px
import seaborn as sns
import math
data = pd.read_csv("all.15k.patients.txt",sep = "\t")
data.head()
# to perform z transformation, use StandardScaler().fit_transform()
size_nodspos_scale = StandardScaler().fit_transform(data[["size", "nodespos"]])
# Substitute the original data with the transformed data
data[["size", "nodespos"]] = size_nodspos_scale
# notice column "size" and "nodespos" have changed
data.head()
# subset columns
specific_data = data[["alivestatus","grade","nodespos","nodesexam","size","pgr","er"]]
# set alivestatus as categorical
specific_data = specific_data.astype({'alivestatus': 'category'})
# randomly choose 5000 rows as test set
data_test = specific_data.sample(5000)
# set difference
whole_index = specific_data.index
test_index = data_test.index
train_index = whole_index.difference(test_index)
#
data_train = specific_data.loc[train_index,]
sns.scatterplot(data=specific_data, x="nodespos", y="size", hue="alivestatus")
Now that we have calculated the distance between the genes, we can group the genes that are close together. There are two main clustering methods : Hierarchical & K-means. Hierarchical is a bottoms-up approach where we first group together genes that are most similar and work our way up. K-means clustering is when you know how many clusters you need and then you try their centers.
Hierarchical clustering can be constructed in many way, but the most simplest is the UPGMA method, which is also used for constructing phylogenetic trees. Steps are:
# use sklearn.cluster.AgglomerativeClustering for prediction purpose
X = np.array(specific_data[["nodespos","size"]])
hclustering = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
hclustering.fit_predict(X)
# predicted labels:
print("The predicted labels:" , hclustering.labels_)
# drawing dendrogram with sklearn is difficult,
# so we use scipy.cluster.hierarchy.linkage and scipy.cluster.hierarchy.dendrogram for plotting purpose
# it's not practical to draw all the 15,000 points here, so I drew 100 samples here
linked = linkage(X[1:100,], 'complete')
dendrogram(linked, orientation='top')
plt.show()
For K-means clustering you must first decide on a $k$. In this case letâs pick 2. This means 2 centers, representing two differnet groups, will be randomly placed in our dataset, and then points will be assigned to the group whose center they are closest to.
X = np.array(specific_data[["nodespos","size"]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
confusion_matrix(specific_data.alivestatus, kmeans.labels_)
kmean_df = specific_data[["nodespos","size","alivestatus"]].copy()
kmean_df["kmeans"] = kmeans.labels_
# plot with seaborn
sns.scatterplot(data=kmean_df, x="nodespos", y="size", hue="kmeans", style = "alivestatus")
# our new patient
ndata = np.array([0,0])
# centroids coordinates
kmeans.cluster_centers_
# calculate the Euclidean distance from centroids to our new data (0,0)
dist_alive = math.sqrt(sum(np.square(kmeans.cluster_centers_[0,:] - ndata)))
dist_dead = math.sqrt(sum(np.square(kmeans.cluster_centers_[1,:] - ndata)))
# normalize the distance
norm_dist_alive = dist_alive/(dist_alive + dist_dead)
norm_dist_dead = dist_dead/(dist_alive + dist_dead)
# probability = 1 - normalized distance
p_alive = 1 - norm_dist_alive
p_dead = 1 - norm_dist_dead
print("Probability of patient alive: ", p_alive)
print("Probability of patient dead: ", p_dead)
fig = px.scatter_3d(specific_data, x='nodespos', y='size', z='grade',
color='alivestatus')
fig.show()
The strategy of K-Nearest Neighbor is to simply look at the values that are closes to it and assign the classification accordingly. $k$ stands for the number of closest neighbors it will look for and classify accordingly. For example, if $k=3$, then a test point will look for the $3$ closest neighbors. If $2/3$ are classified as $1$ and $1/3$ is classified as $0$, then the classification of the test point is simply $1$.
So there is no model to create, simply know the points and predict the rest.
You can guess that changing the $k$ will have a large effect on the itâs classification. The larger the $k$, the more smooth the separation between them, where as the smaller the $k$, the tighter the boundaries making them more prone to over-fitting.
Ideally you want to try several $k$ values to see which gives the least error.
X_test = np.array(data_test[["nodespos", "size"]])
y_test = np.array(data_test["alivestatus"])
X_train = np.array(data_train[["nodespos", "size"]])
y_train = np.array(data_train["alivestatus"])
data_knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
predicted = data_knn.predict(X_test)
predicted_proba = data_knn.predict_proba(X_test)
confusion_matrix(y_test, predicted)
fpr, tpr, threshold = roc_curve(y_test, predicted_proba[:,1])
roc_auc = auc(fpr, tpr)
print("AUC of the model is: ", auc(fpr, tpr))
plt.figure()
plt.title('AUC plot')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate (1 - specificity)')
plt.show()
knn3_df = data_test[["nodespos","size","alivestatus"]].copy()
knn3_df["predict"] = predicted
sns.scatterplot(data=knn3_df, x="nodespos", y="size", hue="predict", style = "alivestatus")
data_knn_100 = KNeighborsClassifier(n_neighbors=100).fit(X_train, y_train)
predicted_100 = data_knn_100.predict(X_test)
predicted_proba_100 = data_knn_100.predict_proba(X_test)
confusion_matrix(y_test, predicted_100)
fpr_100, tpr_100, threshold_100 = roc_curve(y_test, predicted_proba_100[:,1])
roc_auc_100 = auc(fpr_100, tpr_100)
print("AUC of the model is: ", auc(fpr_100, tpr_100))
plt.figure()
plt.title('AUC plot (k = 100)')
plt.plot(fpr_100, tpr_100, 'b', label = 'AUC = %0.2f' % roc_auc_100)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate (1 - specificity)')
plt.show()
knn100_df = data_test[["nodespos","size","alivestatus"]].copy()
knn100_df["predict"] = predicted_100
sns.scatterplot(data=knn100_df, x="nodespos", y="size", hue="predict", style = "alivestatus")